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Abstract 

In many real world problems, optimization decisions have to be 
made with limited information. The decision maker may have no a 
priori or posteriori data about the often nonconvex objective func- 
tion except from on a limited number of points that are obtained over 
time through costly observations. This paper presents an optimiza- 
tion framework that takes into account the information collection (ob- 
servation), estimation (regression), and optimization (maximization) 
aspects in a holistic and structured manner. Explicitly quantifying 
the information acquired at each optimization step using the entropy 
measure from information theory, the (nonconvex) objective function 
to be optimized (maximized) is modeled and estimated by adopting a 
Bayesian approach and using Gaussian processes as a state-of-the-art 
regression method. The resulting iterative scheme allows the decision 
maker to solve the problem by expressing preferences for each aspect 
quantitatively and concurrently. 

1 Introduction 

In many real world problems, optimization decisions have to be made with 
limited information. Whether it is a static optimization or dynamic control 
problem, obtaining detailed and accurate information about the problem or 
system can often be a costly and time consuming process. In some cases, 
acquiring extensive information on system characteristics may be simply 
infeasible. In others, the observed system may be so nonstationary that by 
the time the information is obtained, it is already outdated due to system's 
fast-changing nature. Therefore, the only option left to the decision-maker 
is to develop a strategy for collecting information efficiently and choose a 
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model to estimate the "missing portions" of the problem in order to solve it 
satisfactorily and according to a given objective. 

To make the discussion more concrete, consider the problem of maximiz- 
ing a (Lipschitz) continuous nonconvex objective function, which is unknown 
except from its value at only a small number of data points. The decision 
maker may have no a priori information about the function and start with 
zero data points. Furthermore, only a limited number of -possibly noisy- 
observations may be available before making a decision on the maximum 
value and its location. The function itself, however, remains unknown even 
after the decision is made. What is the best strategy to address this problem? 

The decision making framework presented in this paper captures the 
posed problem by taking into account the information collection (observa- 
tion), estimation (regression), and (multi-objective) optimization aspects in 
a holistic and structured manner. Hence, the framework enables the decision 
maker to solve the problem by expressing preferences for each aspect quan- 
titatively and concurrently. It explicitly incorporates many concepts that 
have been implicitly considered by heuristic schemes, and builds upon many 
results from seemingly disjoint but relevant fields such as information the- 
ory, machine learning, and optimization and control theories. Specifically, 
it combines concepts from these fields by 

• explicitly quantifying the information acquired using the entropy mea- 
sure from information theory, 

• modeling and estimating the (nonconvex) function or (nonlinear) sys- 
tem adopting a Bayesian approach and using Gaussian processes as a 
state-of-the-art regression method, 

• using an iterative scheme for observation, learning, and optimization, 

• capturing all of these aspects under the umbrella of a multi-objective 
"meta" optimization formulation. 

Despite methods and approaches from machine (statistical) learning are 
heavily utilized in this framework, the problem at hand is very different 
from many classical machine learning ones, even in its learning aspect. In 
most classical application domains of machine learning such as data mining, 
computer vision, or image and voice recognition, the difficulty is often in 
handling significant amount of data in contrast to lack of it. Many methods 
such as Expectation-Maximization (EM) inherently make this assumption, 
except from "active learning" schemes [3]. Information theory plays plays 
an important role in evaluating scarce (and expensive) data and developing 
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strategies for obtaining it. Interestingly, data scarcity converts at the same 
time the disadvantages of some methods into advantages, e.g. the scalabihty 
problem of Gaussian processes. 

It is worth noting that the class of problems described here are much 
more frequently encountered in practice than it may first seem. For ex- 
ample, the class of black-box methods known as "kriging" |10) have been 
applied to such problems in geology and mining as well as to hydrology 
since mid-1960s. In addition, the solution framework proposed is applicable 
to a wide variety of fields due to its fundamental nature. One example is 
decentralized resource allocation decisions in networked and complex sys- 
tems, e.g. wired and wireless networks, where parameters change quickly 
and global information on network characteristics are not available at the 
local decision- making nodes. Another example is security-related decisions 
where opponents spend a conscious effort to hide their actions. A related 
area is security and information technology risk management in large-scale 
organizations, where acquiring information on individual subsystems and 
processes can be very costly. Yet another example application is in biologi- 
cal systems where individual organisms or subsystems operate autonomously 
(even if they are part of a larger system) under limited local information. 

2 Problem Definition and Approach 

A concrete definition of the motivating problem mentioned in the intro- 
duction section is helpful for describing the multiple aspects of the limited 
information decision making framework. Without loss of any generality, let 

be a nonempty, convex, and compact (closed and bounded) subset of the 
original problem domain ^ of d dimensions. The original domain ^ does 
not have to be convex, compact, or even fully known. However, adopting a 
"divide and conquer" approach, the subset X provides a reasonable starting 
point. Define next the objective function to be maximized 

f : X ^R, 

which is unknown except from on a finite number of points (possibly imper- 
fectly) observed. As a simplifying assumption, let / be Lipschitz continuous 
on X. One of the main distinguishing characteristics of this problem is the 
limitations on set of observations 

Qn '■= {xi, ■ ■ ■ ,Xn : Xi ^ X n > 1}, 
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due to cost of obtaining information or non-stationarity of the underlying 
system. Assume for now that the cost of observing the value of the objective 
function f(x) is the same for any x ^ X. Then, a basic search problem is 
defined as follows: 

Problem 1 {Basic Search Problem) Consider a Lipschitz- continuous ob- 
jective function / : Af — )■ M on the d- dimensional nonempty, convex, and 
compact set C M'^. The function is unknown except from on a finite 
number of observed data points. What is the best search strategy 

fljy := {xi, . . . , xjy : xi ^ X Mi, > 1} 

that solves 
for a given N ? 

The number of observations, A^, in Problem [1] may be imposed by the 
nature of the specific application domain. In many problems, where there is 
no time constraint, adopting an iterative (one- by-one) approach, and hence 
choosing A^ = 1 is clearly beneficial as it allows for usage of incoming new 
information at each step. Alternatively, the assumption on the equal obser- 
vation cost can be relaxed and be formulated as a constraint 

E ^o{x) < C, 

where Co{x) : ^ ^ M is the observation cost function, and the scalar C is the 
total "exploration budget" . It is also possible to define this cost iteratively 
based on the (distance from) prGvious observation, e.g. Coi^Xji^Xn—i ). In 
such location-based iterative search scheme can be considered. 

The simplest (both conceptually and computationally) strategy to solve 
Problem [T] is random search on the domain X. As such no attempt is made 
to "learn" the properties of the function /. Unless, / is "algorithmically 
random" |14] . which is rarely the case, this strategy wastes the information 
collected on /. A slightly more complicated and very popular set of strate- 
gies combine random search with simple modeling of the function through 
gradient methods. In this case, the collected information is used to model / 
rudimentarily using derived gradients to "define slopes" in a heuristic man- 
ner. Then, these slopes of / are explored step-by-step in the upwards direc- 
tion to find a local maximum, after which the search algorithm randomly 
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jumps to another location. It is also possible to randomize the gradient 
climbing scheme for additional flexibility |24j . 

The framework presented in this paper takes one further step and ex- 
plicitly models the (entire) objective function / (on the set X) using the 
information collected instead of heuristically describing only the slopes. The 
function /, which models, approximates, and estimates /, belongs to a cer- 
tain class functions such that f € J^. The selection and properties of this 
class is based on "a priori" information available and can be interpreted 
as the "world view" of the decision maker. These properties can often be 
expressed using meta-parameters which are then updated based on the ob- 
servations through a separate optimization process. Likewise, a slower time- 
scale process can be used for model selection if processing capabilities permit 
a multi- model approach. 

This model-based search process, which lies at the center of the frame- 
work, is fundamentally a manifestation of the Bayesian approach [18]. It 
first imposes explicit and a priori modeling assumptions by choosing / from 
a certain class of functions, J-", and then infers (learns, updates) / in a struc- 
tured manner as more information becomes available through observations. 

Prom a computational point of view, the decision making framework 
with limited information lies at one end of the computation vs. observation 
spectrum, while random search is at the opposite end. The framework tries 
to utilize each piece of information to the maximum possible extent almost 
regardless of the computational cost. The underlying assumption here is: 
observation is very costly whereas computation is rather cheap. 
This assumption is not only valid for a wide variety of problems from dif- 
ferent fields ranging from networking and security to economics and risk 
management, but also inspired from biological systems. In many biological 
organisms, from single cells to human beings, operating close to this end 
of the computation-observation spectrum is more advantageous than doing 
random search. 

When doing random search on the domain X, at each stage i.e. given 
the previous observations, each remaining candidate data point provides 
equivalent amount of information. However, this is not the case when doing 
model-based search. Depending on the model adopted and previous infor- 
mation collected, different unexplored points provide different amount of 
information. This information can be exactly quantified using the definition 
of entropy and information from the field of (Shannon) information theory. 
Accordingly, the scalar quantity I{f, ^In) denotes the aggregate information 
obtained from the set of observations fi^ within the model represented by /. 
A related issue is the reliability and possibly noisy nature of observations. 
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which win be discussed in further detail in the next section. 

An extension of Problem [1] that captures the aspects discussed above is 
defined next. 

Problem 2 (Model- based Search Problem) Let f : ^ W be an ob- 
jective function on the d- dimensional nonempty, convex, and compact set 
X C W^, which is unknown except from on a finite number of observed data 
points. Further let f{x) be an estimate of the objective function obtained 
using an a priori model and observed data. What is the best search strategy 
Oat := {xi, . . . ,xn : Xi ^ X V«, > 1} that solves the multi-objective 
problem with the following components? 

• Objective 1: max^jv f{x) given f{x) 



Here, R{-,-) is a risk or expected loss function quantifying the mismatch 
between actual and estimated functions on the observation data fM3i- The 
scalar quantity I is the aggregate information obtained from the set of ob- 
servations within the model represented by f. The cardinality of 0,^, 
N, can be either given, e.g. N = 1, or defined as an additional constraint 
^xeftu '^oix) < C, where Co{x) : X ^ M. is the observation cost function, 
and the scalar C is the total "exploration budget". 



Figure 1: The three fundamental aspects of decision making with limited 
information. 



• Objective 2. 




• Objective 3. 



maxHjv T{f, Q.n) 
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It is important to observe here that the three objectives defined in Prob- 
lem [2] are (almost) independent from and orthogonal to each other despite 
being closely related. Objective 1 purely aims to maximize the unknown 
objective function / using the best estimate (model) /. Objective 2 focuses 
on minimizing the error between the estimate / and the real unknown func- 
tion / based on the observations made. Objective 3 tries to maximize the 
amount of information provided by each (costly) observation or experiment. 
It is worth noting that Objective 3 is independently formulated from Objec- 
tive 2, in other words, exploration is done independently from estimation. 
In contrast, ensuring a balance between Objective 1 and 2 is necessary to 
ensure that solution is robust. These objectives and the fundamental as- 
pects of decision making with limited information are visually depicted in 
Figure [H 



Table 1: Fundamental Trade-offs 



Exploration 




Exploitation 


Observation 


versus 


Computation 


Robustness 




Optimization 



There are multiple trade-offs that are inherent to this problem as listed 
in Table [TJ The first one, exploration versus exploitation, puts exploration 
or obtaining more observations against exploitation, i.e. trying to achieve 
the given objective. Observation versus computation captures the trade- 
off between building sophisticated models using the available information 
to the fullest extend and making more observations. Robustness versus 
optimization puts risk avoidance against optimization with respect to the 
original objective as in exploitation. 

3 Methodology 

This section presents the methods that are utilized within the framework 
which addresses the problem defined in the previous one. First, the re- 
gression model and Gaussian Processes (GP) are presented. Subsequently, 
modeling and measurement of information is discussed based on (Shannon) 
information theory. 
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3.1 Regression and Gaussian Processes (GP) 

Problem [5] presented in the previous section involves inferring or learning 
the function / using the set of observed data points. This is known as 
the regression problem in machine learning and is a supervised learning 
method since the observed data constitutes at the same time the learning 
data set. This learning process involves selection of a "model", where the 
learned function / is, for example, expressed in terms of a set of parameters 
and specific basis functions, and at the same time minimization of an error 
measure between the functions / and / on the learning data set. Gaussian 
processes (GP) provide a nonparametric alternative to this but follow in 
spirit the same idea. 

The main goal of regression involves a trade-off. On the one hand, it 
tries to minimize the observed error between / and /. On the other, it 
tries to infer the "real" shape of / and make good estimations using / even 
at unobserved points. If the former is overly emphasized, then one ends 
up with "over fitting", which means / follows / closely at observed points 
but has weak predictive value at unobserved ones. This delicate balance 
is usually achieved by balancing the prior "beliefs" on the nature of the 
function, captured by the model (basis functions), and fitting the model to 
the observed data. 

This paper focuses on Gaussian Process [23] as the chosen regression 
method within the framework developed without loss of any generality. 
There are multiple reasons behind this preference. Firstly, GP provides 
an elegant mathematical method for easily combining many aspects of the 
framework. Secondly, being a nonparametric method GP eliminates any 
discussion on model degree. Thirdly, it is easy to implement and under- 
stand as it is based on well-known Gaussian probability concepts. Fourthly, 
noise in observations is immediately taken into account if it is modeled as 
Gaussian. Finally, one of the main drawbacks of GP namely being computa- 
tional heavy, does not really apply to the problem at hand since the amount 
of data available is already very limited. 

It is not possible to present here a comprehensive treatment of GP. 
Therefore, a very rudimentary overview is provided next within the con- 
text of the decision making problem. Consider a set of M data points 

V = {xi, . . . ,xm}, 

where each Xj € <-f is a d— dimensional vector, and the corresponding vector 
of scalar values is f{xi), i = 1, . . . ,M. Assume that the observations are 
distorted by a zero-mean Gaussian noise, n with variance a ~ A/'(0,(t). 
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Then, the resulting observations is a vector of Gaussian y = f{x) + n ~ 
Af{f{x),a). 

A GP is formally defined as a collection of random variables, any fi- 
nite number of which have a joint Gaussian distribution. It is completely 
specified by its mean function m{x) and covariance function C{x,x), where 

m{x) = E[f{x)] and C{x, x) = E[{f{x) — m{x)){f{x) — m(x))], Vx, x 

Let us for simplicity choose m(x) = 0. Then, the GP is characterized 
entirely by its covariance function C{x,x). Since the noise in observation 
vector y is also Gaussian, the covariance function can be defined as the sum 
of a kernel function Q{x,x) and the diagonal noise variance 

C{x, x) = Q{x, x) + al, y X, X & T>, (1) 

where / is the identity matrix. While it is possible to choose here any 
(positive definite) kernel •), one classical choice is 



Q{x, x) = exp 



-- \\x — x\\ 
2 " " 



Note that GP makes use of the well-known kernel trick here by representing 
an infinite dimensional continuous function using a (finite) set of continuous 
basis functions and associated vector of real parameters in accordance with 
the representer theorem [26]. 

The (noisy )0 training set [V, y) is used to define the corresponding GP, 
QV(Q, Civ)), through the M x M covariance function C(D) = Q+al, where 
the conditional Gaussian distribution of any point outside the training set, 
y X,y ^ V, given the training data (P, t) can be computed as follows. 
Define the vector 

k{x) = [Q{xi,x),. . .Q{xM,x)] (3) 

and scalar 

K = Q{x,x)+a. (4) 

Then, the conditional distribution p{y\y) that characterizes the GV{0, C) is 
a Gaussian A/'(/, v) with mean / and variance v, 

f{x) = k^C-^y and v{x) = k- k^C'^k. (5) 

This is a key result that defines GP regression as the mean function 
f{x) of the Gaussian distribution and provides a prediction of the objective 



^The special case of perfect observation without noise is handled the same way as long 
as the kernel function •) is positive definite 
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function f{x). At the same time, it belongs to the well-defined class f € J^, 
which is the set of all possible sample functions of the GP 

T := {f{x) -.X^R such that / € gV{0, C{V)), VP, C}, 

where C(V) is defined in ([T|) and QV through ([HD, ([1]), and (0), above. Fur- 
thermore, the variance function v{x) can be used to measure the uncertainty 
level of the predictions provided by /, which will be discussed in the next 
subsection. 

3.2 Quantifying Information in Observations 

In the framework presented, each observation provides a data point to the 
regression problem (estimating / by constructing /) as discussed in the 
previous subsection. Many works in the learning literature consider the 
"training" data used in regression available (all at once or sequentially) and 
do not discuss the possibility of the decision maker influencing or even op- 
timizing the data collection process. The active learning problem defined 
in Section [2] requires, however, exactly addressing the question of "how to 
quantify information obtained and optimize the observation process?" . Fol- 
lowing the approach discussed in [171 I18| . the framework here provides a 
precise answer to this question. 

Making any decision on the next (set of) observations in a principled 
manner necessitates first measuring the information obtained from each ob- 
servation within the adopted model. It is important to note that the infor- 
mation measure here is dependent on the chosen model. For example, the 
same observation provides a different amount of information to a random 
search model than a GP one. 

Shannon information theory readily provides the necessary mathemat- 
ical framework for measuring the information content of a variable. Let 
p be a probability distribution over the set of possible values of a dis- 
crete random variable A. The entropy of the random variable is given 
by H{A) = ^^pi log2(l/pi), which quantifies the amount of uncertainty. 
Then, the information obtained from an observation on the variable, i.e. 
reduction in uncertainty, can be quantified simply by taking the difference 
of its initial and final entropy, 

I = Hq — Hi. 

It is important here to avoid the common conceptual pitfall of equating en- 
tropy to information itself as it is sometimes done in communication theory 
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literatureo Within this framework, (Shannon) information is defined as 
a measure of the decrease of uncertainty after (each) observation (within a 
given model). This can be best explained with the following simple example. 

3.2.1 Example: Bisection 

Choose a number between 1 and 64 randomly with uniform probability 
(prior). What is the best searching strategy for finding this number? Let 
the random variable A represent this number. In the beginning the entropy 
of A is 

1=1 ^ ^ 

The information maximization problem is defined as 

maxi = maxHo — Hi = min Hi, 

since Hq, the entropy before the action (obtaining information) is constant. 
The entropy Hi is the one after information is obtained, and hence is directly 
affected by the specific action chosen. Now, define the action as setting 
a threshold 1 < t < 64 to check whether the chosen number is less or 
higher than this threshold t. To simplify the analysis, consider a continuous 
version of the problem by defining p as the probability of the chosen number 
being less than the threshold. Thus, in this uniform prior case, the problem 
simplifies to 

minimi = min plog{p) + {1 — p) log(l — p), 
p p 

which has the derivative 

= log(p) - log(l - p). 

dp 

Clearly, the threshold p* = 0.5 is the global minimum, which roughly cor- 
responds to f = 32 (ignoring quantization and boundary effects). Thus, bi- 
section from the middle is the optimal search strategy for the uniform prior. 
In this example, the number can be found in the worst-case in 6 steps, each 

^ Since this issue is not of great importance for the class of prob- 
lems considered in communication theory, it is often ignored. How- 
ever, the difference is of conceptual importance in this problem. See 
http: //www. ccrnp.n cif erf .gov/~toms/ information. is .not . uncertainty . htmll for 
a detailed discussion. 
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providing one bit of information. Nonuniform probabilities (priors) can be 
handled in a similar way. 

If this search process (bisection) is repeatedly applied without any feed- 
back, then it results in the optimal quantization of the search space both in 
the uniform case above and for the nonuniform probabilities. If feedback is 
available, i.e. one learns after each bisection whether the number is larger 
or less than the boundary, then this is as shown the best search strategy. 



4 Model 

The model adopted in the framework for decision making with limited in- 
formation builds on the methods presented in the previous section and ad- 
dresses the problem introduced in Section [2j The model consists of three 
main parts: observation, update of GP for regression, and optimization to 
determine next action. These three steps, shown in Figure [2] are taken it- 
eratively to achieve the objectives in Problem [2j As a result of its iterative 
nature, this approach can be considered in a sense similar to the well-known 
Expectation-Maximization algorithm [3]. 



Goals and Criteria 



Model Update 
GP Regression 



Estimated 



Function 
or System 



Multi-Objective 
Optimization 



Action 



Noise 



Observed value 



Observation 



Data point 



Figure 2: The main parts of the underlying model of the decision making 
framework. 

Observations, given that they are a scarce resource in the class of prob- 
lems considered, play an important role in the model. Uncertainties in the 
observed quantities can be modeled as additive noise. Likewise, properties 
(variance or bias) of additive noise can be used to model the reliability of 
(and bias in) the data points observed. GPs provide a straightforward math- 
ematical structure for incorporating these aspects to the model under some 
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simplifying assumptions. 

The set of observations collected provide the (supervised) training data 
for GP regression in order to estimate the characteristics of the function or 
system at hand. This process relies on the GP methods described in Subsec- 
tion [XT] Thus, at each iteration an up-to-date description of the function or 
system is obtained based on the latest observations. Specifically, / provides 



noise model, the noise variance a can be used to model uncertainties, e.g. 
older and noisy data resulting in higher a values. 

The final and most important part of the model provides a basis for 
determining the next action after an optimization process that takes into 
account all three objectives in Problem [2j The information aspect of these 
objectives is already discussed in Subsection [321 An important issue here is 
the fact that there are infinitely many candidate points in this optimization 
process, but in practice only a finite collection of them can be evaluated. 

4.1 Sampling Solution Candidates 

When making a decision on the next action through multi-objective opti- 
mization, there are (infinitely) many candidate points. A pragmatic solution 
to the problem of finding solution candidates is to (adaptively) sample the 
problem domain X to obtain the set 



that does not overlap with known points. In low (one or two) dimensions, 
this can be easily achieved through grid sampling methods. In higher di- 
mensions, (Quasi) Monte Carlo schemes can be utilized. For large problem 
domains, the current domain of interest A" can be defined around the last or 
most promising observation in such a way that such a sampling is compu- 
tationally feasible. Likewise, multi-resolution schemes can also be deployed 
to increase computational efficiency. 

Although such a solution may seem restrictive at first glance, it is in spirit 
not very different from other schemes such as simulated annealing, which are 
widely used to address nonconvex optimization problems. However, a major 
diff'erence between this and other schemes is the fact that the candidate 
sampling and evaluation are done here "a priori" due to experimentation 
being costly while other methods rely on abundance of information. 

^See |231 Chap 7.2] for a discussion on asymptotic analysis of GP regression. It should 
not be noted, however, that asymptotic properties are of little relevance to the problem 
at hand. 



an estimate 




Assuming an additive Gaussian 



:= {xi, . . . , XT ■ Xi G X , Xi ^ V, Vi} 
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A natural question that arises is: whether and under what conditions 
does such a samphng method give satisfactory results. The following result 
from \30\ |3T] provides an answer to this question in terms of number of 
samples required. 

Theorem 1 Define a multivariate function f{x) on the convex, compact 
set X, which admits the maximum x* = argmax^-g^f /(x). Based on a set 
of N random samples G = {xi, . . . , x^v : Xi £ X \/i} from the entire set X , 
let X := argmaXa;g0 /(x) he an estimate of the maximum x* . 

Given an e > and 6 > 0, the minimum number of random samples N 
which guarantees that 

Pr {Pr[f{x*) > /(x)] < e) > 1 - 5, 

i.e. the probability that 'the probability of the real maximum surpassing the 
estimated one being less than e ' is larger than 1 — 5, is 

N> 



1/(1 -E) 

Furthermore, this bound is tight if the function f is continuous on X . 

It is interesting and important to note that this bound is independent of the 
sampling distribution used (as long as it covers the whole set X with nonzero 
probability), the function / itself, as well as the properties and dimension 
of the set X . 



4.2 Quantifying Information in GP 

The information measurement and GP approaches in Section [3] can be di- 
rectly combined. Let the zero- mean multivariate Gaussian (normal) proba- 
bility distribution be denoted as 



p{x) 



■ exp 



--[x — m]'^|Cp(x)| [x — m]^ , X € A", 



(6) 



where | • | is the determinant, m is the mean (vector) as defined in ([5]), and 
Cp{x) is the covariance matrix as a function of the newly observed point 
X £ X given by 



Cp{x) 



C{V) 
k[x) 



k{x) 



(7) 
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Here, the vector k(x) is defined in ^ and k in (jH, respectively. The matrix 
C(P) is the covariance matrix based on the training data V as defined in 

The entropy of the multivariate Gaussian distribution ([6]) is [Ij 
i/(x) = ^ + ^ln(2vr) + iln|C,(x)|, 

where d is the dimension. Note that, this is the entropy of the GP estimate 
at the point x based on the available data V. The aggregate entropy of the 
function on the region X is given by 



H^aa ■- I -ln\Cpix)\dx. 



(8) 



The problem of choosing a new data point x such that the informa- 
tion obtained from it within the GP regression model is maximized can be 
formulated in a way similar to the one in the bisection example: 



X = ar, 



gmaxl = argmax / [Hq — Hi] (ix = argmin / — ln|Cq 
* * Jxex ^ Jxex 2 



(x, x)\dx, 
(9) 

where the integral is computed over all x E X, and the covariance matrix 
Cq{x,x) is defined as 



Cg (x , x) 



C{V) F(x) k'^ix) 

k{x) k Q(x,x) 

k{x) Q(x,x) K 



(10) 



and R = Q{x, x) + a. Here, C{T>) is a M x M matrix and Cq is a (M + 2) x 
(M + 2) one, whereas k and Q{x,x) are scalars and /c is a M x 1 vector. 
This result is summarized in the following proposition. 

Proposition 1 As a maximum information data collection strategy for a 
Gaussian Process with a covariance matrix C(V), the next observation x 
should be chosen in such a way that 



arg maxX = arg min 



In \Cq{x, x)\dx, 



xex 



where Cq{x,x) is defined in fW\) . 
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An Approximate Solution to Information Maximization 

Given a set of (candidate) points G sampled from X, the result in Propo- 
sition [1] can be revisited. The problem in ([9]) is then approximated [31] 
by 

maxX min > ln|Cq(x,a;)| (11) 
=^ x = argmin TT |C„(x,x)|, 

xG& 

using monotonicity property of the natural logarithm and the fact that the 
determinant of a covariance matrix is non-negative. Thus, the following 
counterpart of Proposition [1] is obtained: 

Proposition 2 As an approximately maximum information data collection 
strategy for a Gaussian Process with a covariance matrix C{T)) and given 
a collection of candidate points Q, the next observation x G should he 
chosen in such a way that 

X = argmin 1 | \C„{x,x)\ ^ argmaxX, 

where Cq{x,x) is given in / HOj) . 

Although it is an approximation, finding a solution to the optimization 
problem in Proposition [2] can still be computationally costly. Therefore, a 
greedy algorithm is proposed as a computationally simpler alternative. Let, 
X* £ Q he defined as 

X* := argmax|C7p(x)| = \C{V)\ \k{x) - k{x)C~^{V)k'^ {x)\, 

x£& 

where the matrix Cp is given by ([7]) [21j. The first term above, |C(P)| is 
fixed and the second one, 

\k{x) - k{x)C-\V)k^{x)\, 

is the same as the GP variance v{x) in ([5]). Hence, the sample x* is one of 
those with the maximum variance in the set 0, given current data T>. 

It follows from (jlOj) and basic matrix theory that x = x for a given 
X then \Cq{x,x)\ is minimized. As a simplification, ignore the dependencies 
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between Cq{x, x) matrices for different x £ Q. Then, clioosing tlie maximum 
variance x as 



leads to a large (possibly largest) reduction in Hxee l^'gCa;, and hence 
provides a rough approximate solution to ([IT]) and to the result in Proposi- 
tion [TJ This result is consistent with widely-known heuristics such as "max- 
imum entropy" or "minimum variance" methods [28\ and a variant has been 
discussed in [T7]. 

Proposition 3 Given a Gaussian Process with a covariance matrix C(T>) 
and a collection of candidate points Q, an approximate solution to the max- 
imum information data collection problem defined in Proposition [I] is to 
choose the sample point(s) x in such a way that it has (they have) the max- 
imum variance within the set 0. 

5 Optimization with Limited Information 

Let / : ^ M be the unknown Lipschitz-continuous function of interest 
on the d-dimensional nonempty, convex, and compact set X C M*^. The 
amount of information about this function available to the decision maker is 
limited to a finite number of possibly noisy observations. Since the observa- 
tions are costly, the goal of the decision maker is to find the maximum of /, 
estimate / as accurately as possible using available observations, and select 
the most informative data points, at the same time. This naturally calls 
for an iterative and myopic optimization procedure since each new observa- 
tion provides a new data point that concurrently affects the maximization, 
function estimation (regression), and information quantity. 

The first and basic objective is the maximization of the function f{x) on 
x £ X . As a simplification, observations are assumed to be sequential, one 
at a time. Since / is basically unknown, this problem has to be formulated 
as 



where / is the best estimate obtained through GP regression ^ using 
the current data set T>. Data uncertainty (observation errors) is modeled 
through additive Gaussian noise with variance cr as a first approximation. 



X = argmaxf(x) ~ argmin 



J{\C,{x,x)\, 
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The second objective is to minimize the difference (estimation error) 
between / and /. Define e{x) = f{x) — f{x), Vx € X. Given the set of 
noisy observations 



= {f{xi)+n{xi):xGV, Vi} 



where n ~ M{0, a) denotes zero mean Gaussian noise, it is possible to use 
another GP regression ([5]) to estimate this error function, e(T>, x), on the en- 
tire set X. Thus, the second objective is to ensure that the next observation 
X solves 



Note that, F2 here corresponds to a risk or loss estimate function. 

The third objective is to maximize the amount of information obtained 
with each observation x, or 



given the best estimate of the original function, /. This objective has already 
been discussed in Section 13.21 in detail. 

The values of the three objectives, Fi, F2, F3, cannot be evaluated nu- 
merically on the entire set X. Therefore, a sampling method is used as de- 
scribed in Section |4] to obtain a set of solution candidates G, which replaces 
X in the maximization and minimization problems above. Next, specific 
problem formulations are presented based on such a sampling of solution 
candidates. The overall structure of the framework is visualized in Figure [3l 

5.1 Solution Approaches 

The most common approach to multi-objective optimization is the weighted 
sum method \19\ [9]. The three objectives discussed above can be com- 
bined to obtain a single objective using the respective weights [wi, W2, W3], 
Yl^=i^i = 1, < tDj < IVi. Assuming a single data point is chosen from 
and observed among the candidates at each step, i.e. x = Oi, a specific 
weighted sum formulation to address Problem [2] is obtained. 

Proposition 4 The solution, x € @, to the optimization problem 





In \Cq{x, x)\dx, 



max 



Fix) = J2 Fii^) = ^ifii) - ^' + ^32:(x, /), (12) 
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Figure 3: The decision making framework for static optimization with hm- 
ited information. 



constitutes the best search strategy for this weighted sum formulation of Prob- 
lem\^ 

As discussed in Subsection 13.2! and stated in Proposition [21 the informa- 
tion objective, F3, in ([12]) can be approximated by substituting it with GP 
variance v{x) in ([5]) to decrease computational load. Thus, an approximation 
to the solution in Proposition [5] is: 

Proposition 5 The solution, x G 0, to the optimization problem 

maxF(x) = Fi{x) = wif{x) - W2— \K^,T^,t)\ +W3v{x), (13) 
xe© ^ — ' iV ^ — ' 

i=l rG0 

where v{x) is defined in approximates the search strategy in Proposi- 
tion^ 

The weighting scheme described is only meaningful if the three objec- 
tives are of the same order of magnitude. Therefore, the original objective 
functions, -Fj, i = 1, 2, 3, have to be transformed or "normalized". There are 
many different approaches to perform such a transformation [19., l9j. The 
most common one, which coincident ally is known as normalization, aims to 
map each objective function to a predefined interval, e.g. [0, 1]. To do this, 
estimate first an upper Ff^ and lower bound on each individual objective 
Fi{x). Then, the i*^ normalized objective is 

F^ix) - 

y-^) - pU _ pL ■ 
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The main issue in normalization is to determine the appropriate upper 
and lower bounds, which is a very problem-dependent one. In the case 
of Proposition [5l the estimated functions / and e on the set Q as well as 
the existing observations V, can be utilized to obtain these values. The 
specific bounds for the respective objectives F^^ = maxa,g0/(x), = 
min^jge /(a;), F^ = max^^e\e{x,V)\, F^ = 0, = max^^e k{x) , and 
F^ = provide a suitable starting estimate and can be combined with a 
prior domain knowledge if necessary. Thus, a normalized version of the 
formulation in Proposition [5] is obtained. 

Proposition 6 The solution, x S G, to the optimization problem 

maxF(x) = E^^(^) = ^ (/(-) " ^0"^^ E P, r)|+g.(x). 

(14) 

where Aj = F^' — F^ i = 1,2, 3, provides an approximation to the best search 
strategy for solving the normalized weighted-sum formulation of Problem [H 

The bounded objective function method provides a suitable alterna- 
tive to the weighted sum formulation above in addressing the multi-objective 
problem defined. The bounded objective function method minimizes the 
single most important objective, in this case Fi{x), while the other two 
objective functions F2{x) and i*3(x) are converted to form additional con- 
straints. Such constraints are in a sense similar to QoS ones that naturally 
exist in many real life problems [20l[2l|29]. As an advantage, in the bounded 
objective formulation there is no need for normalization. 

The bounded objective counterpart of the result in Proposition [5] is as 
follows. 

Proposition 7 The solution, x E 0, to the constrained optimization prob- 
lem 

max/(x) (15) 
such that < F2(x) = — |e(x, 'D,t)\ < 61, 

and < -F3(x) = v{x) < 62, 

where bi and 62 given (predetermined) scalar bounds on F2 and F3, 
respectively, provides an approximate best search strategy for a bounded- 
objective formulation of Problem \^ 
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The advantage of the bounded objective function method is that it pro- 
vides a bound on the information collection and estimation objectives while 
maximizing the estimated function. This leads in practice to an initial em- 
phasis on information collection and correct estimation of the objective func- 
tion. In that sense, the method is more "classical" , i.e. follows the common 
method of learn first and maximize later. Furthermore, it does not require 
normalization, i.e. it is easier to deploy. The method has, however, a sig- 
nificant disadvantage which makes its usage prohibitive. In large-scale or 
high-dimensional problems, the space to explore to satisfy any bound on 
information is simply immense. Therefore, one does not have the luxury 
of identifying the function first to maximize it later as it would take too 
many samples to do this. In such cases, it makes more sense to deploy the 
weighted sum method, possibly along with a cooling scheme to modify the 
weights as part of a cooling scheme to balance depth- first vs. breadth- first 
search. 

Until now, it has been (implicitly) assumed that the static optimization 
problem at hand is stationary. However, in a variety of problems this is not 
the case and the function f{x,t) changes with time. The decision making 
framework allows for modeling such systems in the following way. Let 

= {f{xi, ti) + n{xi,ti) : Xi eV,ti < t, Vi}, 

be the set of noisy or unreliable past observations until time t, where n{x, t) ~ 
Af{0,a{t)) is the zero mean Gaussian "noise" term at time t. Now, the de- 
terioration in the past information due to change in /(x, t) can be captured 
by increasing the variance of the noise term, cr{t), with time. For example, 
a simple linear dynamic can be defined as 

da(t) 

where r] > captures the level of stationarity, e.g. a large rj indicates a 
rapidly changing system and function f{x,t). 

5.2 Algorithm 

An algorithmic summary of the solution approaches discussed above for a 
specific set of choices is provided by Algorithm [H which describes both 
weighted-sum and bounded objective variants. 
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Algorithm 1 Optimization with Limited Information 
1: Input: Function domain, X, GP meta-parameters, objective weights 

[wi,'W2,W3] or bounds 61,62, initial data set {T>,y). 
2: Use GP with a Gaussian kernel and specific expected error variances for 

function / and error function e estimation. 
3: while Search budget available, 1 < n < Nmax- do 
4: Sample domain X to obtain Q{n). In some cases, Q{n) = Vn. 
5: Estimate / and e based on observed data {T>,y) on 0(n) using GPs. 
6: Compute variance, of / ([5]) on 0(n) as an estimate of Z(/). 
7: if Weighted-sum method then 

8: Next action maximizes a normalized and weighted sum of objectives 

Yl\=i ^ stated in Proposition [H 
9: else if Bounded objective method then 

10; Next action is solution to the constrained problem in Proposition 

m 

11: end if 

12: Update the observed data {T>,y). 
13: end while 



5.3 Numerical Analysis 

The Algorithm [1] is illustrated next with multiple numerical examples. It is 
worth reminding that the main issue here is to solve the optimization prob- 
lems with minimum data using active learning. In all examples, a uniform 
grid is used to sample the solution space rather than resorting to a more 
sophisticated method since the examples are chosen to be only one or two 
dimensional for visualization purposes. 

Example 1 

The first numerical example aims to visualize the presented framework and 
algorithm. Hence, the chosen function is only one dimensional, f{x) = 
sin{5x)/x on the interval X = [0.1,3.9]. The interval is linearly sampled 
to obtain a grid with a distance of 0.01 between points, i.e. = {xi G 
Xyi : xi = 0.1, X2 = 0.11, . . . ,xn = 3.9}. A Gaussian kernel with variance 
0.1 is chosen for estimating both / and e. The weights are equal to one, 
w = [1, 1, 1], in the weighted-sum method. The bounds are 61 = 0.5 for the 
error bound and 62 = 0.2 for the bound on maximum variance estimate in 
the bounded objective method. The initial data consists of a single point, 
x = 0.1. 
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Figure U] shows the results based on the normaUzed weighted-sum method 
in Proposition [6] after 5 iterations (6 samples in total, together with the 
initial data point). The variance here is v{x) of the estimated function / 
using data points D. Clearly, the estimated peak is not the one of the real 
function /. 

Next, Figure [5] shows that after 11 iterations (12 data points in V), the 
function and the location of its peak is estimated correctly. The sequence 
of points selected during the iteration process are: 

V = {0.47, 3.22, 1.17, 1.66, 2.43, 2.06, 3.9, 2.83, 3.6, 0.82, 1.42}. 



Weighted Sum Optimization with Limited Information 





variance 
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estimate 
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data points 
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peak 




0.5 1 1.5 



Figure 4: Optimization result using the weighted-sum method with 6 data 
points. 



The amount of information obtained during the iterative optimization 
is of particular interest. Figure [6] depicts the mean variance v and entropy 
I of the estimated function / on at each iteration step. In this specific 
example, the two quantities are very well correlated. Note, however, that 
this correlation is a function of the relative weights between information 
collection and other objectives. 

Finally, Figure [7] depicts the results of the bounded objective method 
with the given bounds. The number of iterations is 11 as before, which 
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Weighted Sum Optimization with Limited Information 
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Figure 6: Mean variance v and entropy X on at each iteration step. 



24 




gives an opportunity of direct comparison with the weighted-sum method. 
The sequence of points selected during the iteration process are: 

V = {0.47, 3.22, 1.17, 1.66, 2.43, 2.06, 3.9, 2.83, 3.6, 0.82, 1.42}. 
Example 2 

The objective function in the second numerical example is the Goldstein&Price 
function [8] , which is shown in Figure [8] in its inverted form to ensure con- 
sistency with the maximization formulation in this paper. The problem do- 
main consists of the two dimensional rectangular region X = [—2, 2] x [—2, 2], 
which is linearly sampled to obtain a uniform grid with a 0.05 interval be- 
tween sample points. A Gaussian kernel with variance 0.5 and 0.1 is chosen 
for estimating / and e, respectively. The weighted-sum method is utilized 
in Algorithm [1] with the weights w = [4, 2, 3]. The search budget is cho- 
sen as 50 before stopping the algorithm (for the search space of approx. 
6400 samples in the grid). The real global minimum (peak) of the (in- 
verted) Goldstein&Price function is at (0,-1) and the location found by 
the algorithm using the 50 data points is (—0.15,-1.05). Figure [9] depicts 
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the estimated function, the data points as well as the optimum found. Al- 
though the real optimum value is —3 (in the inverted version) while the 
obtained one is —9.75, the result is still very satisfactory considering that 
the simple sampling scheme used and the Goldstein&Price function takes 
values in a range of 1 million, i.e. the error is less than 0.001 percent of the 
range. Finally, Figure [10] depicts the mean variance v and entropy I of the 
estimated function / on at each iteration step. 




Figure 8: The inverted Goldstein&Price function [8]. 



Example 3 

The third example uses the same setup as the second one but this time 
with the (inverted) Brain function [6j shown in Figure [TTl The rectangular 
problem domain X = [—5, 10] x [0, 15] is sampled uniformly to obtain a 
grid of points with a 0.2 interval. The real global minimums (peaks) of 
the (inverted) Branin function are at (9.4,2.47), (-vr, 12.28), and (7r,2.28) 
whereas the locations found by the algorithm are (9,2.6), (—3.2,12), and 
(3,2.2). The values at these locations found vary between —4.3 and —0.5 
compared to the real global value of —0.4 (of the inverted function). Thus, 
the algorithm again performs satisfactorily. Figure [9] shows the computed 
location of one optimum, the data points, as well as the estimated function 
based on the data points. 
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Figure 9: Optimization of the inverted Goldstein&Price function ^ using 
the weighted-sum method with 50 data points. 
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Figure 10: Mean variance v and entropy Z on at each iteration step. 
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Inverted Branin function 




Figure 11: The inverted Branin function [6]. 




Figure 12: Optimization of the inverted Branin function [6] using the 
weighted-sum method with 50 data points. 



28 



Example 4 

The fourth example is based on the six-hump camel function f7] (see Fig- 
urell3p on the domain X = [—2, 2] x [—2, 2], which is sampled uniformly with 
a 0.05 interval. All of the parameters are chosen to be the same as before. 
Figure shows the computed location of two optimums, the 50 data points, 
as well as the estimated function based on the data points. The optimum 
locations found are (0, 0.65) and (0.05, —0.6) with respective values of 0.98 
and 1.06, whereas the real locations are (—0.09,0.71) and (0.09, —0.71) with 
the value 1.03. 

Inverted Six-hump Camel function 



2^ 




-2 -2 



Figure 13: The inverted six-hump camel function. 



6 Literature Review 

Decision making with limited information is related to search theory. The 
idea of using information (theory) in this context is hardly new as evidenced 
by the article "A New Look at the Relation Between Information Theory and 
Search Theory" from 1979 [22]. The subject is further studied in [11] • The 
topic of optimal search is more recently revisited by [35], which contains 
substantial historical notes and studies problems where the search target 
distribution in itself is unobservable. 
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Optimization of Six-hump Camel function witli Limited Information 
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Figure 14: Optimization of the inverted six-hump camel function ^ using 
the weighted-sum method with 50 data points. 

The book [TH] provides important and valuable insights into the rela- 
tionship between information theory, inference, and learning. Measuring 
information content of experiments using Shannon information is explicitly 
mentioned and a slightly informal version of the bisection example in Subsec- 
tion [3]2] is discussed. However, focusing mainly on more traditional coding, 
communication, and machine learning topics, the book does not discuss the 
type of decision making problems presented in this paper. 

Learning plays an important role in the presented framework, especially 
regression, which is a classical machine (or statistical) learning method. A 
very good introduction to the subject can be found in A complemen- 
tary and detailed discussion on kernel methods is in [26]. Another relevant 
topic is Bayesian inference |33l I18j . which is in the foundation of the pre- 
sented framework. In machine learning literature, Gaussian processes (GPs) 
are getting increasingly popular due to their various favorable characteris- 
tics. The book |23j presents a comprehensive treatment of GPs. Additional 
relevant works on the subject include [181 [26| I16j . which also discuss GP 
regression. 

Convex optimization [4J is a well-understood topic that is often easy 
to handle even if available information is limited. Optimizing nonconvex 
functions, however, is still a research subject [12] . It is interesting to note 
that the method known as kriging in global optimization is almost the same 
as GP regression in machine learning. The field stochastic programming 
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focuses on optimization under uncertainty but assumes a certain amount 
of prior knowledge on the problem at hand and models the uncertainty 
probabilistically [25] . The popular heuristic method simulated annealing [23] 
is essentially based on iterative random search. Another popular heuristic 
scheme particle swarm optimization [T3] is also based on random search but 
parallel in nature as a distinguishing characteristic rather than iterative. 

Gaussian processes have been recently applied to the area of optimiza- 
tion and regression [5] as well as system identification [32j. While the latter 
mentions active learning, neither work discusses explicit information quan- 
tification or builds a connection with Shannon information theory. The 
recent articles [15^ I34j . which utilize GP regression for optimization in a 
setting similar to the one in this paper and for state-space inference and 
learning, respectively, do not consider information-theoretic aspects of the 
problem, either. Likewise, the article |10) on stochastic black box optimiza- 
tion, which considers a problem similar to the one here, does not take into 
account explicit measurement of information. 

The area of active learning or experiment design focuses on data scarcity 
in machine learning and makes use of Shannon information theory among 
other criteria [28]. The paper p!7] discusses objective functions which mea- 
sure the expected informativeness of candidate measurements within a Bayesian 
learning framework. The subsequent study [27| investigates active learning 
for GP regression using variance as a (heuristic) confidence measure for test 
point rejection. 

7 Discussion 

The foundation of the approach adopted in this paper is Bayesian infer- 
ence, where the main idea is to choose an a priori model and update it with 
actual experimental data observed (see [iSl Chap. 2] for a beautiful intro- 
ductory discussion on the subject). As long as the a priori model is close 
to the reality (of the problem at hand), this inference methodology works 
very efficiently as indicated by the numerical examples in Section 15.31 In 
many cases this background information, which is sometimes referred to as 
"domain knowledge", is already available. However, in others one has to 
explore the model domain and learn model meta-parameters in a time scale 
naturally longer than the one of actual optimization |16] . 

The GP regression adopted in the presented framework is only one 
method for function estimation and other, e.g. parametric, methods can 
easily replace GP for the regression part. In any case, the regression method- 
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ology here is consistent with the principle of "Occam's razor", more specif- 
ically its interpretation using Kolmogorov complexity |14j . A priori, the 
optimization problems at hand are more probable to be simple rather than 
complex to describe in accordance with universal distribution [14J- Hence, 
given a data set it is reasonable to start describing it with the simplest expla- 
nation. GP regression already incorporates this line of thinking by relying 
on a kernel-based approach and making use of the representer theorem [231 
Chap. 6.2]. As a visual example, we refer Figures H] and [5] for a comparison 
of function estimates with different sets of available data. 

This paper considers a class of problems where data is scarce and ob- 
taining it is costly. Information theory plays an especially important role 
in devising optimal schemes for obtaining new data points (active learning) . 
The entropy measure from Shannon information theory provides the neces- 
sary metric for this purpose, which quantifies the "exploration" aspect of the 
problem. Using a multi-objective optimization formulation, the presented 
framework allows explicit weighting of exploration vs. exploitation aspects. 
This trade-off is also very similar to one between the well-known depth-first 
vs. breadth-first search algorithms in search theory. 

The amount of information obtained from each data point is different 
here only because a specific a priori general model is utilized to explain 
the observed data (GP regression). Because of this the amount of infor- 
mation obtained is specific to the model. Otherwise, without this Bayesian 
approach, each data point would give the same information (inversely pro- 
portional to the total number of candidate points). 

The illustrative examples discussed are low-dimensional, which makes 
it possible to use grids for sampling. However, in higher dimensions (i.e. 
when the problem is much more "difficult") this "luxury" is not affordable 
and one has to necessarily resort to Monte Carlo methods. In such cases, 
the trade-off between exploration and exploitation is even more emphasized. 
Possible methods to address this issue include, "cooling" approaches similar 
to those used in simulated annealing, multi-resolution sampling based on 
region of interest or using topological properties of Gaussian mixtures to 
intelligently estimate candidate points based on the current state. 

The optimization approach presented here can also be interpreted from 
a biological perspective. If an analogy between the decision-maker and a 
biological organism is established, then the a-priori Bayesian model (meta 
parameters of the GP) that is refined over a long time scale corresponds to 
evolution of a species in an environment (problem domain). Each individual 
organism belogning to the species obtains new information to achieve its 
objective while preserving resources as much as possible. The existing evo- 
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lutionary basis (GP model) gives them an advantage to find a solution much 
faster compared to random search. From the perspective of the species, it 
also makes sense for some of its members to explore the model (meta param- 
eter) domain and further refine it through adaptation. Those with better 
meta parameters achieve then their objectives even more efficiently and ob- 
tain an evolutionary edge in natural selection (assuming competition). 

8 Conclusion 

The decision making framework presented in this paper addresses the prob- 
lem of decision making under limited information by taking into account the 
information collection (observation), estimation (regression), and (multi- 
objective) optimization aspects in a holistic and structured manner. The 
methodology is based on Gaussian processes and active learning. Various 
issues such as quantifying information content of new data points using in- 
formation theory, the relationship between information and GP variance as 
well as related approximation and multi-objective optimization schemes are 
discussed. The framework is demonstrated with multiple numerical exam- 
ples. 

The presented framework should be considered mainly as an initial step. 
Future research directions are abundant and include further investigation 
of the exploration-exploitation trade-off, adaptive weighting parameters, 
and random sampling methods for problems in higher dimensional spaces. 
Additional research topics are the relationship of the framework with ge- 
netic/evolutionary methods, dynamic control problems, and multi-person 
decision making, i.e. game theory. 
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